Use of this document

This is a study note for using \(lmerTest\) package for Linear Mixed-Effects Models/Regression (LMER).

1. Introduction

The modeling framwork for random effect is Linear Mixed-Effects Models, or the following equivalent name in different disciplines:

(source: Xie (2010) Regression Analysis (Chinese), CH16)

1.1 Random effect for Nested data

Let’s quickly introduce random effect and fixed effect then explain why need to use LMM to deal with nest data.

  • Nested data: Level of one random effect only occur within one unique level of the higher. It often occurs in the design of experiments and in particular in the context of randomized experiments and randomized controlled trials. wiki

Without a random effect for each experimental unit, a one-to-one correspondence between observations and experimental units is assumed. Including random effects in a model is one way to account for a lack of independence among observations that might be expected based on the design of experiment. Here is the definitions of fixed effect and random effect:

  • Fixed effect: test for effect of parameters
  • Random effect: control for non-independence from nested or hierarchical structure. i.e. blocking, replicated sampling from individuals, a subset of many possible group that could be sampled.

1.2 Simpson’s paradox

In probability and statistics, Simpson's paradox refer to a trend appears in several different groups of data but disappears or reverses when these groups are combined. This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations. When not consider level for data where samples are not independent with levels, problems can be resulted in within-group regression, Between-group regression, and Total regression, shown in the folowing:

Simpson’s paradox

When imporperly consider consider level in a model, there two kind of statisical fallaies may occur:

  • Ecological Fallacy: a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. It could occur when when consider only cluster (lv.2) as the independent variable
  • Atomistic fallacy: the fallacy of drawing inferences regarding variability across units defined at a higher level based on data collected for units at a lower level. It could occur when consider only individual samples (lv.1) as the dependent variable

1.3 Linear Mixed-Effects Models

Linear Mixed-Effects Models assume that the data being analysed are drawn from a hierarchy of different populations whose differences relate to that hierarchy.

  • can improve parameter estimates: Because random effect are drawn from a commen distribution with a mean and standard deviation, infromation from all groups can be used to improve parameter estimates fror each individual group- a process called partial polling or shrinkage. (source: https://www.youtube.com/watch?v=QCqF-2E86r0 )

1.4 Minisum sample size by Degree of freedom

sample size requirement
Reference Minimum ratio of Group size/Sample size
Kreft (1996) 30/30
Hox (1998) 100/10

1.5 Intraclass Correlation Coefficient (ICC)

The ICC, or Intraclass Correlation Coefficient, can be very useful in many statistical situations, but especially so in Linear Mixed Models. The definition of ICC is the following:

\[ ICC = \rho = \frac{IntraclassVariance}{TotalVariance }\]

Why ICC is useful?

  • It can help you determine whether or not a linear mixed model is even necessary. If you find that the correlation is zero, that means the observations within clusters are no more similar than observations from different clusters. Go ahead and use a simpler analysis technique.
  • It can be theoretically meaningful to understand how much of the overall variation in the response is explained simply by clustering. For example, in a repeated measures psychological study you can tell to what extent mood is a trait (varies among people, but not within a person on different occasions) or state (varies little on average among people, but varies a lot across occasions).
  • It can also be meaningful to see how the ICC (as well as the between and within cluster variances) changes as variable are added to the model.

(Source: https://www.theanalysisfactor.com/the-intraclass-correlation-coefficient-in-mixed-models/)

Luke (2004)’s idea of whether using LMM:

  • Theoretic View: if the research is a hierachical topic;
  • Statistical View: if the analysis violate the independence assumption of residual (underfit due to missing variables);
  • Empirical View: if ICC is too high (variance explained by grouping to ignore the random effect of grouping variables).
Range of Intraclass coefficient to determine if apply LMM (Coben, 1988)
Intraclass correlation ICC if Applying LMM
low ICC < 0.059 No neccessary
moderate 0.059 < ICC < 0.138 Yes
high ICC > 0.139 Yes

ICC for unconditional and conditional models:

  • Usually, the ICC is calculated for the null model (“unconditional model”). Or it is called ICC(1)
  • However, according to Raudenbush and Bryk (2002) or Rabe-Hesketh and Skrondal (2012) it is also feasible to compute the ICC for full models with covariates (“conditional models”) and compare how much a level-2 variable explains the portion of variation in the grouping structure (random intercept). Or it is called ICC(2).

(source: https://www.rdocumentation.org/packages/sjstats/versions/0.17.4/topics/icc)

2. Fit model in R

In R, both \(lmer\) and \(lmerTest\) package can model Linear mixed-Effects models/regression. We will use \(lmerTest\) package because it can output significance of fixed effect.

fit = lmerTest::lmer(data = , formula = DV ~ Fixed_Factor_1 + (Random_intercept + Random slope | Random_Factor_1), ...)

Variable:

Symbol:

(source: https://github.com/usplos/Eye-movement-related/blob/master/Linear%20mixed%20model%20(Hierarchical%20linear%20model)%20using%20R%20and%20JAMOVI.md; https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)

2.1 R formula composition

R formula composition
Coefficient R formula Simplified form
Fix intercept lm(Y ~ 1 + X1 + X2, …) lm(Y ~ X1 + X2, …)
Random intercept lmer(Y ~ 1 + X1 + X2 + (1  + 1| group), …) lmer(Y ~ X1 + X2 + (1| group), …)
Fixed slope lmer(Y ~ 1 + X1 + X2 + (1 + 1 | group), …) lmer(Y ~ X1 + X2 + (1 | group), …)
Random slope lmer(Y ~ 1 + X1 + X2 + (1 + X1 | group), …) lmer(Y ~ X1 + X2 + (X1 | group), …)

(souce: https://zhuanlan.zhihu.com/p/63092231 )

2.2 Types of LMM

when talk about LMM, Random intercept is often applied. So only Fixed slope and Random slope are considered.

Types of LMM.
Model category Meaning Formula Equivalent form
Fixed slope model (nullmodel, interceptonlymodel, unconditionalmodel) Random intercept with fixed mean lmer(Y ~ (1 | g), …) lmer(Y ~ 1+ (1 | g), …)
Fixed slope model Random intercept with a priori means lmer(Y ~ 0 + offset(o) + (1 | g), …) lmer(Y ~ -1 + offset(o) + (1 | g), …)
Fixed slope model (crossed random effect) Intercept varying among g1 and g2 lmer(Y ~ (1 | g1) + (1 | g2), …) -
Fixed slope model (3-level nested model) Intercept varying among g1 (lv.3) and g2 (lv.2) within g1 (lv.3) lmer(Y ~ (1 | g1 / g2), …) lmer(Y ~ (1 | g1) + (1 | g1: g2), …)
Random slope model Uncorrelated random intercept and slope lmer(Y ~ X + ( X || g), …) lmer(Y ~ X + (1 | g) + (0 + X | g), …)
Random intercepts and slopes model (recommended if possible) Correlated random intercept and slope lmer(Y ~ X + ( X | g), …) -

(source: https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf )

(crossed random effect, 3-level nested model, source: https://www.youtube.com/watch?v=QCqF-2E86r0 )

2.4 Solution when model fails to converge

  • Not enought degree of freedom: When number of observation is less than number of random effect, then model cannot converge. Solution 1: simplify the model (i.e. removing the problematic random slopes) Solution 2: increase the number of iteration
    fit = lmerTest::lmer(data, formula, control = lmerControl(optCtrl = list(maxfun = 20000)))

  • Singular fit: the data cannot support the complex random effect structure? Solution: remove the most complex term in the random effect (i.e. random slope)

2.5 Model diagnositcs

  • residual vs. fitted for entire model: no trends, be normally distributed with equal variance;
  • residuals vs. each independent variable: no trends, be normally distributed with equal variance;
  • residual vs. fitted values with each random-effect group: be normall with equal variance within and between groups;
  • conditional mean of random effects: be normally distributed;

2.6 Other considerations

  • Random effect should have many levels (>5 said to be needed)
  • Varaibles that are strongly conllinear (strong correlation) can cause problems;
  • Structure of random effect should be specified correctly. This requires good understanding of your data and study design;
  • watch out for warning message about the model convergence

3. Effect analysis

Work flow of LMM analysis:

  1. Collect and prepare data, consider if nested data. (Theoretic View)
  2. Select the proper distribution for predictive variable for independence assumption of residual. It might lead to result in linear mixed-effect models for Gaussium distribution, or generalized linear mixed-effect model for other distribution (Statistical View)
  3. build null model, check ICC to determine suitability of LMM (Empirical View)
  4. include variables in lv.1 to add fixed slope. then compare the Lv1 effect size with previous model;
  5. include variables in lv.2 to add random slope. then compare the Lv2 effect size with previous model;
  6. evaluate assumption of the full model (Statistical View)

3.1 Random effect

  • use sjstats::icc() to calculate the intraclass Correlation Coefficient using the null model, which is used to determine if intraclass variable is large enought to apply LMM
# nullmodel = lmerTest::lmer(DV ~ 1+ (1|g), ...)
# sjstats::icc(nullmodel)
  • use summary() to view the the variance of random effect;
  • use lmerTest::ranova() to produce a ANOVA-like table for random effects to test significance of Lv.2 variables. It checks whether the model becomes significantly worse if a certain random effect is dropped (formally known as likelihood ratio tests), if this is not the case, the random effect is not significant.
# summary(fit1)
# lmerTest::ranova(fit1)

(source: https://www.rensvandeschoot.com/tutorials/lme4/)

3.2 Fixed effect: Main/Interaction effect

Similar to the GLM/OLS, use anova() to check if the main effects and interaction effect is significant in F-tests.

# anova(fit1)

3.3 Cross-level Interaction effects

In analysis with multilevel (i.e. at least two-level), two types of cross-level interactions could occur between the higher and lower level:

  • Compositional effects: inter-group (or inter-context) differences in an outcome (for example, disease rates) are attributable to differences in group composition (that is, in the characteristics of the individuals of which the groups are comprised)
  • Contextual effects: intra-group differences are attributable to the effects of group level variable or properties.

(source: https://jech.bmj.com/content/56/8/588)

3.4 Effect size

Effect size is the ratio of the variance of residual between improved model and the base model. For example, the effect size * 100% refers to the percentage of imporvement by adding variables into the new model.

\[ ES = \frac{variance_{base} - variance_{new}}{variance_{base}} \]

Range of effect size corresponding to levels of improvemnt as adding variables to the model (Cohen, 1988)
Effect size Improvement
0.2 ~ 0.15 weak
0.15 ~ 0.35 moderate
> 0.35 strong

4. Application of LMM in Randomized statistical experiments

In many research area, LMM is usually applied to the following common randomized statistical experiments The Depanding variable, Random effect, and Fixed effect will be listed out for type of experiments, respectively. During the model selection, it is important to consider whether:

4.1 Completely Randomized Design (CRD)

The main advantage of this design is randomization. The post-test comparison with randomized subjects controls for the main effects of history, maturation, and pre-testing; because no pre-test is used there can be no interaction effect of pre-test and X. Another advantage of this design is that it can be extended to include more than Controlled if necessary.

Completely Randomized Design

Completely Randomized Design

  • Depanding variable: Observational unit
  • Random effect: None
  • Fixed effect: treatment

4.2 Randomized Complete Block Design (RCBD)

The RCBD is the standard design for agricultural experiments where similar experimental units are grouped into blocks or replicates.

  • The number of blocks is the number of replications.
  • Treatments are assigned at random within blocks of adjacent subjects, each treatment once per block.
  • Any treatment can be adjacent to any other treatment, but not to the same treatment within the block
Randomized Complete Block Design

Randomized Complete Block Design

  • Depanding variable: Observational unit
  • Random effect: Blocking
  • Fixed effect: treatment

4.3 Generalized Randomized Block Design (GRBD)

Like a randomized complete block design (RCBD), a GRBD is randomized. Within each block, treatments are randomly assigned to experimental units: this randomization is also independent between blocks. In a (classic) RCBD, however, there is no replication of treatments within blocks.

  • Depanding variable: Observational unit
  • Random effect: Experimental unit (individual), Blocking
  • Fixed effect: Treatment

5. Project: Healthcare Rate over the past 7 years

library(lmerTest)
library(ggplot2)
library(readr)
library(dplyr)
library(kableExtra)
library(lme4)

The data source is Health Insurance Exchange Public Use Files

LMM_rate_puf
In [1]:
# packages
## system
import os, warnings, sys, pathlib, glob
from pathlib import Path
warnings.filterwarnings('ignore')

## data structure
import pandas as pd

## utils
import zipfile

## graphing
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Rectangle
%matplotlib inline
In [2]:
# create temporary location
directory_to_extract_to = './tmp_PUF'
if not os.path.isdir(directory_to_extract_to):
    os.mkdir(directory_to_extract_to)
In [3]:
# unzip function
def unzip_file(path_to_zip_file, directory_to_extract_to):
    with zipfile.ZipFile(path_to_zip_file[0], 'r') as zip: 
        # printing all the contents of the zip file 
        zip.printdir() 
        zip.extractall(path = directory_to_extract_to) 
In [4]:
# looping over year
rate = pd.DataFrame()
for year in range(2014,2021):
# for year in range(2014,2016):
    # locate .zip file
    zipfile_name = 'rate_puf'
    zipfile_name_y_zip = zipfile_name +'_'+ str(year) + '.zip'
    zipfile_name_y_csv = directory_to_extract_to + '/' + zipfile_name + '.csv'
    path_to_zip_file = list(Path('.').rglob(zipfile_name_y_zip))
    
    # unzipping
    unzip_file(path_to_zip_file, directory_to_extract_to)
    tmp_r = pd.read_csv(directory_to_extract_to + '/' + zipfile_name +'.csv').drop(columns = ['RowNumber', 'IssuerId2', 'VersionNum'], 
                       errors='ignore')
    
    tmp_r = tmp_r.melt(tmp_r.columns[:12],
                       var_name='Plan',
                       value_name='Rate').dropna()
    
    # store data
    rate = pd.concat([rate, tmp_r])
    print('update DataFrame shape:',rate.shape)
    
    # clear tmp folder
    files = glob.glob(directory_to_extract_to +'/*')
    for f in files:
        os.remove(f)
        print("Removed:" + f)
        
print('Final dataframe:', rate.shape, rate.columns)
File Name                                             Modified             Size
Rate_PUF.csv                                   2014-10-21 23:35:30    721861439
update DataFrame shape: (5495326, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2015-08-04 18:48:30    883648804
update DataFrame shape: (12045815, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2016-07-29 13:08:48    786428415
update DataFrame shape: (17850245, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2017-08-10 08:51:40    513337244
update DataFrame shape: (21621123, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2018-07-26 12:05:18    379838083
update DataFrame shape: (24425103, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2019-09-09 09:13:26    329010380
update DataFrame shape: (26926653, 14)
Removed:./tmp_PUF/Rate_PUF.csv
File Name                                             Modified             Size
Rate_PUF.csv                                   2020-08-04 14:57:32    318217453
update DataFrame shape: (29445900, 14)
Removed:./tmp_PUF/Rate_PUF.csv
Final dataframe: (29445900, 14) Index(['BusinessYear', 'StateCode', 'IssuerId', 'SourceName', 'ImportDate',
       'FederalTIN', 'RateEffectiveDate', 'RateExpirationDate', 'PlanId',
       'RatingAreaId', 'Tobacco', 'Age', 'Plan', 'Rate'],
      dtype='object')
In [6]:
# add 'PrimarySubscriber' variable
rate.loc[(rate['Plan']  =='IndividualRate') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan']  =='IndividualTobaccoRate') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndOneDependent') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndTwoDependents') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndThreeOrMoreDependents') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan']  =='Couple') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan']  =='CoupleAndOneDependent') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan']  =='CoupleAndTwoDependents') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan']  =='CoupleAndThreeOrMoreDependents') ,'PrimarySubscriber'] = 'Couple'

# add 'Dependents' variable
rate.loc[(rate['Plan']  =='IndividualRate') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan']  =='IndividualTobaccoRate') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndOneDependent') ,'Dependents'] = 'One'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndTwoDependents') ,'Dependents'] = 'Two'
rate.loc[(rate['Plan']  =='PrimarySubscriberAndThreeOrMoreDependents') ,'Dependents'] = 'ThreeOrMore'
rate.loc[(rate['Plan']  =='Couple') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan']  =='CoupleAndOneDependent') ,'Dependents'] = 'One'
rate.loc[(rate['Plan']  =='CoupleAndTwoDependents') ,'Dependents'] = 'Two'
rate.loc[(rate['Plan']  =='CoupleAndThreeOrMoreDependents') ,'Dependents'] = 'ThreeOrMore'

# update 'Tobacco' variable
rate.loc[(rate['Plan']  =='IndividualRate') & (rate['Tobacco']  == 'Tobacco User/Non-Tobacco User') ,'Tobacco'] = 'Non-Tobacco User'
rate.loc[(rate['Plan']  =='IndividualTobaccoRate') & (rate['Tobacco']  == 'Tobacco User/Non-Tobacco User') ,'Tobacco'] = 'Tobacco User'

print(rate['PrimarySubscriber'].unique(), rate['Dependents'].unique(), rate['Tobacco'].unique())
['Individual' 'Couple'] ['Zero' 'One' 'Two' 'ThreeOrMore'] ['No Preference' 'Non-Tobacco User' 'Tobacco User']
In [7]:
# store the cleaned dataset
rate = rate.drop(columns = ['Plan', 'SourceName', 'ImportDate', 'FederalTIN', 'RateEffectiveDate', 'RateExpirationDate', 'PlanId'], errors='ignore')
rate = rate[['IssuerId', 'BusinessYear', 'StateCode', 'RatingAreaId', 'Tobacco', 'Age', 'PrimarySubscriber', 'Dependents', 'Rate']].reset_index(drop=True)
rate.to_csv('rate.csv', index=False)
In [68]:
# read in the data if need
# rate = pd.read_csv('rate.csv')
rate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29445900 entries, 0 to 29445899
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   IssuerId           int64  
 1   BusinessYear       int64  
 2   StateCode          object 
 3   RatingAreaId       object 
 4   Tobacco            object 
 5   Age                object 
 6   PrimarySubscriber  object 
 7   Dependents         object 
 8   Rate               float64
dtypes: float64(1), int64(2), object(6)
memory usage: 2.0+ GB
In [71]:
rate.describe().T
Out[71]:
count mean std min 25% 50% 75% max
IssuerId 29445900.0 50559.750190 26593.272878 10046.0 28448.00 48129.00 73836.0 99969.0
BusinessYear 29445900.0 2016.319886 1.873248 2014.0 2015.00 2016.00 2018.0 2020.0
Rate 29445900.0 2015.012465 40241.166154 0.0 32.06 339.03 560.0 999999.0
In [113]:
# high pay insurance
rate.loc[(rate['Rate'] > 9000.0), ]
sns.distplot(rate['Rate'][rate['Rate'] > 9000.0])
Out[113]:
<AxesSubplot:xlabel='Rate', ylabel='Density'>
In [2]:
rate = pd.read_csv('rate.csv')
rate.sample(n=100000, random_state=1).to_csv('rate_100000.csv', index=False)
In [ ]:
 
# input data
rate <- readr::read_csv("rate_100000.csv") 
rate$ifFamilyOption <- if_else(rate$Age == 'Family Option', 'Y', 'N') 
rate$ifDependents <- if_else(rate$Dependents == 'Zero', 'N', 'Y') 
rate$ifTobaccoPreference <- if_else(rate$Tobacco == 'No Preference', 'N', 'Y') 
rate$Age <- plyr::revalue(rate$Age, c("Family Option"="0", "65 and over"="64", "64 and over"="64"))  %>% as.numeric() # %>% as.factor()
rate$Dependents <- plyr::revalue(rate$Dependents, c("ThreeOrMore"="3", "Two"="2", "Zero"="0", "One"="1"))  %>% as.numeric() #  %>% as.factor()
rate <- rate %>% 
  mutate(Rate = Rate +1) %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(IssuerId = IssuerId %>% as.factor())
rate <- rate %>% filter(Rate < 7000)
rate %>% str() 
## tibble [99,656 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ IssuerId           : Factor w/ 968 levels "10064","10091",..: 610 63 426 426 391 490 191 807 52 63 ...
##  $ BusinessYear       : num [1:99656] 2015 2014 2016 2015 2020 ...
##  $ StateCode          : Factor w/ 40 levels "AK","AL","AR",..: 15 6 32 32 2 6 28 38 6 6 ...
##  $ RatingAreaId       : Factor w/ 67 levels "Rating Area 1",..: 56 4 21 35 45 9 8 4 6 41 ...
##  $ Tobacco            : Factor w/ 3 levels "No Preference",..: 3 1 2 3 3 1 3 1 1 2 ...
##  $ Age                : num [1:99656] 61 24 61 31 54 48 53 38 0 36 ...
##  $ PrimarySubscriber  : Factor w/ 2 levels "Couple","Individual": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Dependents         : num [1:99656] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Rate               : num [1:99656] 987 322 801 407 693 ...
##  $ ifFamilyOption     : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 1 ...
##  $ ifDependents       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ifTobaccoPreference: Factor w/ 2 levels "N","Y": 2 1 2 2 2 1 2 1 1 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   IssuerId = col_double(),
##   ..   BusinessYear = col_double(),
##   ..   StateCode = col_character(),
##   ..   RatingAreaId = col_character(),
##   ..   Tobacco = col_character(),
##   ..   Age = col_character(),
##   ..   PrimarySubscriber = col_character(),
##   ..   Dependents = col_character(),
##   ..   Rate = col_double()
##   .. )

5.1 Exploratory data analysis

par(mfrow=c(1,2))

hist(rate$Rate)
hist(rate$Age)

boxplot(Rate ~ Age, data = rate)
boxplot(Rate ~ ifFamilyOption, data = rate)

boxplot(Rate ~ IssuerId, data = rate)
boxplot(Rate ~ BusinessYear, data = rate)

boxplot(Rate ~ StateCode, data = rate)
boxplot(Rate ~ RatingAreaId, data = rate)

boxplot(Rate ~ Tobacco, data = rate)
boxplot(Rate ~ ifTobaccoPreference, data = rate)

plot(Dependents ~ PrimarySubscriber, data = rate)
boxplot(Rate ~ PrimarySubscriber, data = rate)

boxplot(Rate ~ Dependents, data = rate)
boxplot(Rate ~ ifDependents, data = rate)

5.2 Null model

I construct the null model by testing any combination of the following three variables. Notice that IssuerId and StateCode might hive nested relation.

  • IssuerId
  • StateCode
  • RatingAreaId
null_model_lv2_v1 <- lmer(Rate ~ (1|RatingAreaId),
           data = rate) ; summary(null_model_lv2_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | RatingAreaId)
##    Data: rate
## 
## REML criterion at convergence: 1448766
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.3920 -0.9682 -0.1129  0.5240 14.2853 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  RatingAreaId (Intercept)   1625    40.31  
##  Residual                 120365   346.94  
## Number of obs: 99656, groups:  RatingAreaId, 67
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  377.537      5.335  54.365   70.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v1)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ (1 | RatingAreaId)
##                    npar  logLik     AIC    LRT Df Pr(>Chisq)    
## <none>                3 -724383 1448772                         
## (1 | RatingAreaId)    2 -724514 1449032 261.98  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
null_model_lv2_v2 <- lmer(Rate ~ (1|StateCode),
                       data = rate) ; summary(null_model_lv2_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | StateCode)
##    Data: rate
## 
## REML criterion at convergence: 1443648
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0982 -0.7837 -0.1526  0.5140 14.6220 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  StateCode (Intercept)   6859    82.82  
##  Residual              114325   338.12  
## Number of obs: 99656, groups:  StateCode, 40
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)   384.26      13.28  39.43   28.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v2)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ (1 | StateCode)
##                 npar  logLik     AIC    LRT Df Pr(>Chisq)    
## <none>             3 -721824 1443655                         
## (1 | StateCode)    2 -724514 1449032 5379.6  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# fail to converge
null_model_lv3_v1 <- lmer(Rate ~ (1|StateCode/RatingAreaId),
                          data = rate) ; summary(null_model_lv3_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | StateCode/RatingAreaId)
##    Data: rate
## 
## REML criterion at convergence: 1442638
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2684 -0.7486 -0.1674  0.5097 14.6686 
## 
## Random effects:
##  Groups                 Name        Variance Std.Dev.
##  RatingAreaId:StateCode (Intercept)   1806    42.49  
##  StateCode              (Intercept)   6602    81.25  
##  Residual                           112520   335.44  
## Number of obs: 99656, groups:  RatingAreaId:StateCode, 438; StateCode, 40
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)   380.30      13.34  38.69    28.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## convergence code: 0
## Model failed to converge with max|grad| = 0.00227963 (tol = 0.002, component 1)
lmerTest::ranova(null_model_lv3_v1)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ (1 | RatingAreaId:StateCode) + (1 | StateCode)
##                              npar  logLik     AIC    LRT Df Pr(>Chisq)    
## <none>                          4 -721319 1442646                         
## (1 | RatingAreaId:StateCode)    3 -721824 1443655 1010.6  1  < 2.2e-16 ***
## (1 | StateCode)                 3 -721518 1443041  397.4  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the best null model
null_model_lv2_v3<- lmer(Rate ~ (1|IssuerId),
                         data = rate) ; summary(null_model_lv2_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | IssuerId)
##    Data: rate
## 
## REML criterion at convergence: 1386404
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8126 -0.5543 -0.0405  0.1829 18.3573 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  IssuerId (Intercept) 61589    248.2   
##  Residual             62420    249.8   
## Number of obs: 99656, groups:  IssuerId, 968
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  241.758      8.363 984.955   28.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v3)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ (1 | IssuerId)
##                npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>            3 -693202 1386410                        
## (1 | IssuerId)    2 -724514 1449032 62624  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# singular
null_model_lv3_v2<- lmer(Rate ~ (1|IssuerId) + (1|StateCode/RatingAreaId),
                         data = rate) ; summary(null_model_lv3_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | IssuerId) + (1 | StateCode/RatingAreaId)
##    Data: rate
## 
## REML criterion at convergence: 1386094
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9067 -0.5533 -0.0623  0.1843 18.4046 
## 
## Random effects:
##  Groups                 Name        Variance Std.Dev.
##  IssuerId               (Intercept) 61703.1  248.40  
##  RatingAreaId:StateCode (Intercept)   500.4   22.37  
##  StateCode              (Intercept)     0.0    0.00  
##  Residual                           61981.8  248.96  
## Number of obs: 99656, groups:  
## IssuerId, 968; RatingAreaId:StateCode, 438; StateCode, 40
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  241.798      8.467 1024.318   28.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(null_model_lv3_v2)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ (1 | IssuerId) + (1 | RatingAreaId:StateCode) + (1 | StateCode)
##                              npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>                          5 -693047 1386104                        
## (1 | IssuerId)                  4 -721319 1442646 56544  1     <2e-16 ***
## (1 | RatingAreaId:StateCode)    4 -693202 1386412   310  1     <2e-16 ***
## (1 | StateCode)                 4 -693047 1386102     0  1     0.9998    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
list(null_model_lv2_v1, null_model_lv2_v2, null_model_lv3_v1, null_model_lv2_v3, null_model_lv3_v2) %>% 
  tibble() %>% 
  setNames('nullmodel') %>% 
  mutate(formula = nullmodel %>% 
           purrr::map_chr(. %>% .@call %>% .[2] %>% as.character() )) %>% 
  mutate(lv2_variance = nullmodel %>% 
           purrr::map_dbl(. %>% insight::get_variance_residual() %>%  .[[1]]%>% as.numeric() )) %>% 
  mutate(residual_variance = nullmodel %>% 
           purrr::map_dbl(. %>% insight::get_variance_intercept() %>% as.numeric() %>% sum() )) %>% 
  mutate(total_variance = lv2_variance + residual_variance) %>%
  mutate(ICC_adjusted = nullmodel %>% 
           purrr::map_dbl(. %>% performance::icc() %>% .[[1]]%>% as.numeric() )) %>%
  select(-nullmodel) %>% 
  kable() %>% kable_styling()
formula lv2_variance residual_variance total_variance ICC_adjusted
Rate ~ (1 | RatingAreaId) 120364.73 1624.755 121989.5 0.0133188
Rate ~ (1 | StateCode) 114324.65 6859.487 121184.1 0.0566038
Rate ~ (1 | StateCode/RatingAreaId) 112519.93 8408.009 120927.9 0.0695291
Rate ~ (1 | IssuerId) 62419.51 61588.629 124008.1 0.4966499
Rate ~ (1 | IssuerId) + (1 | StateCode/RatingAreaId) 61981.82 62203.556 124185.4 NA

The ICC show that IssuerId has strong effect on the parameter, and StateCode and StateCode/RatingAreaId has moderate effect. so I decide to use the three corresponding null model as the base model.

5.3 Fixed Slope Model

I am going to add fixed slopes to the models by including the following variables:

  • Age
  • Tobacco
  • BusinessYear
  • PrimarySubscriber
  • Dependents
  • ifFamilyOption
  • ifDependents
  • ifTobaccoPreference
# adding fixed effect

# not much impovement
fixslope_LMM_lv3_v1 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear  + (1|IssuerId) + (1|StateCode/RatingAreaId), data = rate) 
summary(fixslope_LMM_lv3_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +  
##     BusinessYear + (1 | IssuerId) + (1 | StateCode/RatingAreaId)
##    Data: rate
## 
## REML criterion at convergence: 1316459
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.8065 -0.5815 -0.0928  0.4768 21.8251 
## 
## Random effects:
##  Groups                 Name        Variance  Std.Dev. 
##  IssuerId               (Intercept) 4.773e+04 2.185e+02
##  RatingAreaId:StateCode (Intercept) 5.333e+02 2.309e+01
##  StateCode              (Intercept) 8.313e-05 9.118e-03
##  Residual                           4.121e+04 2.030e+02
## Number of obs: 97489, groups:  
## IssuerId, 966; RatingAreaId:StateCode, 438; StateCode, 40
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                 -5.434e+04  8.448e+02  9.724e+04  -64.32   <2e-16
## Age                          1.008e+01  4.861e-02  9.656e+04  207.31   <2e-16
## ifTobaccoPreferenceY         1.120e+02  2.337e+00  9.653e+04   47.91   <2e-16
## PrimarySubscriberIndividual -1.459e+02  1.104e+01  9.699e+04  -13.21   <2e-16
## ifDependentsY                3.099e+02  9.725e+00  8.957e+04   31.86   <2e-16
## BusinessYear                 2.692e+01  4.190e-01  9.724e+04   64.26   <2e-16
##                                
## (Intercept)                 ***
## Age                         ***
## ifTobaccoPreferenceY        ***
## PrimarySubscriberIndividual ***
## ifDependentsY               ***
## BusinessYear                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Age    ifTbPY PrmrSI ifDpnY
## Age         -0.069                            
## ifTbccPrfrY -0.023  0.007                     
## PrmrySbscrI  0.009 -0.065 -0.003              
## ifDepndntsY -0.057  0.153  0.009  0.487       
## BusinessYer -1.000  0.068  0.022 -0.022  0.049
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(fixslope_LMM_lv3_v1)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
##     BusinessYear + (1 | IssuerId) + (1 | RatingAreaId:StateCode) + 
##     (1 | StateCode)
##                              npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>                         10 -658230 1316479                        
## (1 | IssuerId)                  9 -681632 1363282 46805  1     <2e-16 ***
## (1 | RatingAreaId:StateCode)    9 -658508 1317033   556  1     <2e-16 ***
## (1 | StateCode)                 9 -658230 1316477     0  1          1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v1)

# not much impovement
fixslope_LMM_lv3_v2 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId) + (1|RatingAreaId), data = rate) 
summary(fixslope_LMM_lv3_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +  
##     BusinessYear + (1 | IssuerId) + (1 | RatingAreaId)
##    Data: rate
## 
## REML criterion at convergence: 1316948
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.7243 -0.5871 -0.0930  0.4793 21.7949 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  IssuerId     (Intercept) 47731.2  218.5   
##  RatingAreaId (Intercept)   174.1   13.2   
##  Residual                 41596.6  204.0   
## Number of obs: 97489, groups:  IssuerId, 966; RatingAreaId, 67
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                 -5.472e+04  8.467e+02  9.732e+04  -64.62   <2e-16
## Age                          1.007e+01  4.878e-02  9.672e+04  206.50   <2e-16
## ifTobaccoPreferenceY         1.112e+02  2.342e+00  9.654e+04   47.48   <2e-16
## PrimarySubscriberIndividual -1.469e+02  1.108e+01  9.711e+04  -13.27   <2e-16
## ifDependentsY                3.098e+02  9.755e+00  8.959e+04   31.76   <2e-16
## BusinessYear                 2.711e+01  4.199e-01  9.731e+04   64.56   <2e-16
##                                
## (Intercept)                 ***
## Age                         ***
## ifTobaccoPreferenceY        ***
## PrimarySubscriberIndividual ***
## ifDependentsY               ***
## BusinessYear                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Age    ifTbPY PrmrSI ifDpnY
## Age         -0.069                            
## ifTbccPrfrY -0.022  0.007                     
## PrmrySbscrI  0.009 -0.065 -0.003              
## ifDepndntsY -0.057  0.153  0.009  0.487       
## BusinessYer -1.000  0.068  0.022 -0.021  0.049
lmerTest::ranova(fixslope_LMM_lv3_v2)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
##     BusinessYear + (1 | IssuerId) + (1 | RatingAreaId)
##                    npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>                9 -658474 1316966                        
## (1 | IssuerId)        8 -685651 1371317 54354  1  < 2.2e-16 ***
## (1 | RatingAreaId)    8 -658508 1317031    68  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v2)

# not much impovement
fixslope_LMM_lv3_v3 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId) + (1|StateCode),  data = rate) 
summary(fixslope_LMM_lv3_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +  
##     BusinessYear + (1 | IssuerId) + (1 | StateCode)
##    Data: rate
## 
## REML criterion at convergence: 1317015
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6799 -0.5873 -0.0932  0.4808 21.7522 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  IssuerId  (Intercept) 47695    218.4   
##  StateCode (Intercept)     0      0.0   
##  Residual              41665    204.1   
## Number of obs: 97489, groups:  IssuerId, 966; StateCode, 40
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                 -5.475e+04  8.472e+02  9.735e+04  -64.62   <2e-16
## Age                          1.007e+01  4.880e-02  9.677e+04  206.42   <2e-16
## ifTobaccoPreferenceY         1.110e+02  2.343e+00  9.658e+04   47.38   <2e-16
## PrimarySubscriberIndividual -1.471e+02  1.108e+01  9.713e+04  -13.27   <2e-16
## ifDependentsY                3.102e+02  9.761e+00  8.961e+04   31.78   <2e-16
## BusinessYear                 2.712e+01  4.201e-01  9.734e+04   64.56   <2e-16
##                                
## (Intercept)                 ***
## Age                         ***
## ifTobaccoPreferenceY        ***
## PrimarySubscriberIndividual ***
## ifDependentsY               ***
## BusinessYear                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Age    ifTbPY PrmrSI ifDpnY
## Age         -0.069                            
## ifTbccPrfrY -0.023  0.007                     
## PrmrySbscrI  0.009 -0.065 -0.003              
## ifDepndntsY -0.057  0.153  0.008  0.487       
## BusinessYer -1.000  0.068  0.022 -0.021  0.049
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(fixslope_LMM_lv3_v3)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
##     BusinessYear + (1 | IssuerId) + (1 | StateCode)
##                 npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>             9 -658508 1317033                        
## (1 | IssuerId)     8 -682073 1364161 47130  1     <2e-16 ***
## (1 | StateCode)    8 -658508 1317031     0  1          1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v3)

# best model so far
fixslope_LMM_lv3_v4 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId), data = rate) 
summary(fixslope_LMM_lv3_v4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +  
##     BusinessYear + (1 | IssuerId)
##    Data: rate
## 
## REML criterion at convergence: 1317015
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6799 -0.5873 -0.0932  0.4808 21.7522 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  IssuerId (Intercept) 47695    218.4   
##  Residual             41665    204.1   
## Number of obs: 97489, groups:  IssuerId, 966
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                 -5.475e+04  8.472e+02  9.734e+04  -64.62   <2e-16
## Age                          1.007e+01  4.880e-02  9.677e+04  206.42   <2e-16
## ifTobaccoPreferenceY         1.110e+02  2.343e+00  9.658e+04   47.38   <2e-16
## PrimarySubscriberIndividual -1.471e+02  1.108e+01  9.713e+04  -13.27   <2e-16
## ifDependentsY                3.102e+02  9.761e+00  8.961e+04   31.78   <2e-16
## BusinessYear                 2.712e+01  4.201e-01  9.734e+04   64.56   <2e-16
##                                
## (Intercept)                 ***
## Age                         ***
## ifTobaccoPreferenceY        ***
## PrimarySubscriberIndividual ***
## ifDependentsY               ***
## BusinessYear                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Age    ifTbPY PrmrSI ifDpnY
## Age         -0.069                            
## ifTbccPrfrY -0.023  0.007                     
## PrmrySbscrI  0.009 -0.065 -0.003              
## ifDepndntsY -0.057  0.153  0.008  0.487       
## BusinessYer -1.000  0.068  0.022 -0.021  0.049
lmerTest::ranova(fixslope_LMM_lv3_v4)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
##     BusinessYear + (1 | IssuerId)
##                npar  logLik     AIC   LRT Df Pr(>Chisq)    
## <none>            8 -658508 1317031                        
## (1 | IssuerId)    7 -685770 1371555 54526  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v4)

I did not find any random slope-intercept model, so the full model remain as fixed slop model. The summary is shown at the following:

list(fixslope_LMM_lv3_v1, fixslope_LMM_lv3_v2, fixslope_LMM_lv3_v3, fixslope_LMM_lv3_v4) %>% 
  tibble() %>% 
  setNames('fixedslopemodel') %>% 
  mutate(formula = fixedslopemodel %>% 
           purrr::map_chr(. %>% .@call %>% .[2] %>% as.character() )) %>% 
  mutate(lv2_variance = fixedslopemodel %>% 
           purrr::map_dbl(. %>% insight::get_variance_residual() %>%  .[[1]]%>% as.numeric() )) %>% 
  mutate(residual_variance = fixedslopemodel %>% 
           purrr::map_dbl(. %>% insight::get_variance_intercept() %>% as.numeric() %>% sum() )) %>% 
  mutate(total_variance = lv2_variance + residual_variance) %>% 
  mutate(ICC_adjusted = fixedslopemodel %>% 
           purrr::map_dbl(. %>% performance::icc() %>% .[[1]]%>% as.numeric() )) %>% 
  select(-fixedslopemodel) %>% 
  kable() %>% kable_styling()
formula lv2_variance residual_variance total_variance ICC_adjusted
Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | StateCode/RatingAreaId) 41212.47 48261.93 89474.40 0.5393937
Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | RatingAreaId) 41596.61 47905.29 89501.89 0.5352433
Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | StateCode) 41665.14 47694.91 89360.05 NA
Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) 41665.14 47694.91 89360.05 0.5337386

5.4 Evaluation

fixslope_LM <- lm(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear,
                   data = rate) ; summary(fixslope_LM)
## 
## Call:
## lm(formula = Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + 
##     ifDependents + BusinessYear, data = rate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -530.6 -176.5  -52.6  139.6 4651.1 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.466e+04  9.575e+02 -36.200  < 2e-16 ***
## Age                          1.015e+01  6.522e-02 155.574  < 2e-16 ***
## ifTobaccoPreferenceY         3.331e+02  1.781e+00 187.046  < 2e-16 ***
## PrimarySubscriberIndividual -1.153e+02  1.428e+01  -8.075 6.84e-16 ***
## ifDependentsY                2.483e+02  1.164e+01  21.334  < 2e-16 ***
## BusinessYear                 1.714e+01  4.748e-01  36.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 97483 degrees of freedom
##   (2167 observations deleted due to missingness)
## Multiple R-squared:  0.3825, Adjusted R-squared:  0.3824 
## F-statistic: 1.207e+04 on 5 and 97483 DF,  p-value: < 2.2e-16
anova(fixslope_LM)
## Analysis of Variance Table
## 
## Response: Rate
##                        Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Age                     1 1887697470 1887697470 25024.64 < 2.2e-16 ***
## ifTobaccoPreference     1 2502024051 2502024051 33168.59 < 2.2e-16 ***
## PrimarySubscriber       1   37987482   37987482   503.59 < 2.2e-16 ***
## ifDependents            1   28040030   28040030   371.72 < 2.2e-16 ***
## BusinessYear            1   98349237   98349237  1303.79 < 2.2e-16 ***
## Residuals           97483 7353488144      75434                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2)); plot(fixslope_LM)

# compare LM VS LMM using RMSE
anova(fixslope_LMM_lv3_v4, fixslope_LM)
## Data: rate
## Models:
## fixslope_LM: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
## fixslope_LM:     BusinessYear
## fixslope_LMM_lv3_v4: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + 
## fixslope_LMM_lv3_v4:     BusinessYear + (1 | IssuerId)
##                     Df     AIC     BIC  logLik deviance Chisq Chi Df Pr(>Chisq)
## fixslope_LM          7 1371569 1371636 -685778  1371555                        
## fixslope_LMM_lv3_v4  8 1317049 1317125 -658517  1317033 54522      1  < 2.2e-16
##                        
## fixslope_LM            
## fixslope_LMM_lv3_v4 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::performance_rmse(fixslope_LMM_lv3_v4)
## [1] 203.1839
performance::performance_rmse(fixslope_LM)
## [1] 274.6432

The linear mixed model has significant lower RMSE than its ordinary linear model.

5.5 Possible impovement

The residual plot shown the model has issue of heteroscedasticity. One possible solution is to use the generalized version of LMM, for a quick example :

(Note: the GLLM is not successfully converge, need to fix later)

fullmodel_GLLM <- lme4::glmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId), data = rate %>% as.data.frame(), family = Gamma(link="log") )
summary(fullmodel_GLLM)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: Gamma  ( log )
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +  
##     BusinessYear + (1 | IssuerId)
##    Data: rate %>% as.data.frame()
## 
##       AIC       BIC    logLik  deviance  df.resid 
## 1182895.0 1182970.9 -591439.5 1182879.0     97481 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.784 -0.485 -0.086  0.379 43.002 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  IssuerId (Intercept) 1.2112   1.101   
##  Residual             0.3136   0.560   
## Number of obs: 97489, groups:  IssuerId, 966
## 
## Fixed effects:
##                               Estimate Std. Error t value Pr(>|z|)    
## (Intercept)                 -6.587e+01  1.080e+00  -60.97   <2e-16 ***
## Age                          1.908e-02  1.374e-04  138.86   <2e-16 ***
## ifTobaccoPreferenceY         3.835e-01  6.739e-03   56.91   <2e-16 ***
## PrimarySubscriberIndividual -4.919e-01  3.085e-02  -15.95   <2e-16 ***
## ifDependentsY                1.138e+00  2.703e-02   42.09   <2e-16 ***
## BusinessYear                 3.454e-02  5.366e-04   64.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Age    ifTbPY PrmrSI ifDpnY
## Age         -0.034                            
## ifTbccPrfrY  0.012  0.004                     
## PrmrySbscrI  0.011 -0.072  0.000              
## ifDepndntsY -0.043  0.135  0.001  0.363       
## BusinessYer -0.998  0.031 -0.014 -0.039  0.030
## convergence code: 0
## Model failed to converge with max|grad| = 0.00363449 (tol = 0.001, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
plot(fullmodel_GLLM)

fullmodel_GLM <- glm(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear,
                   data = rate, family = Gamma()) 
summary(fullmodel_GLM)
## 
## Call:
## glm(formula = Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + 
##     ifDependents + BusinessYear, family = Gamma(), data = rate)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0964  -1.4289  -0.1370   0.2854   2.9436  
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  1.582e-01  6.901e-03   22.924  < 2e-16 ***
## Age                         -6.154e-05  5.864e-07 -104.936  < 2e-16 ***
## ifTobaccoPreferenceY        -2.505e-03  2.099e-05 -119.386  < 2e-16 ***
## PrimarySubscriberIndividual -2.401e-03  5.664e-04   -4.239 2.24e-05 ***
## ifDependentsY                4.228e-03  4.558e-04    9.277  < 2e-16 ***
## BusinessYear                -7.366e-05  3.412e-06  -21.591  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.9251532)
## 
##     Null deviance: 165050  on 97488  degrees of freedom
## Residual deviance: 133122  on 97483  degrees of freedom
##   (2167 observations deleted due to missingness)
## AIC: 1321925
## 
## Number of Fisher Scoring iterations: 7
anova(fullmodel_GLM)
## Analysis of Deviance Table
## 
## Model: Gamma, link: inverse
## 
## Response: Rate
## 
## Terms added sequentially (first to last)
## 
## 
##                     Df Deviance Resid. Df Resid. Dev
## NULL                                97488     165050
## Age                  1  13759.5     97487     151290
## ifTobaccoPreference  1  17505.8     97486     133784
## PrimarySubscriber    1    151.3     97485     133633
## ifDependents         1    104.4     97484     133529
## BusinessYear         1    407.1     97483     133122
car::crPlots(fullmodel_GLM)